Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work in progress: upgrade to astropy ecosystem structure #3

Open
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

bmorris3
Copy link
Member

@bmorris3 bmorris3 commented Nov 8, 2019

Hi @pmneila and @cefisher,

I'm a new postdoc working with Kevin Heng, nice to e-meet you @pmneila 👋 . One of my projects is to bring ESP services up to compliance with astropy developer standards, and HELA is the first toolkit I'm starting to work with. I hope for this PR to be the beginning of a conversation about tweaks we can make to the package. Ideally we can collaborate over GitHub, but I'm happy to meet in person if that would be helpful.

This PR is a modest overhaul of the formatting within HELA, and a reorganization of the package structure. There are a few broad goals of the PR:

  • make HELA structure closer to the traditional python package structure
  • make a Pythonic API that exposes HELA's data products to users in an interactive/notebook environment, rather than the legacy command-line interface (e.g.: so you can make your own plots with HELA's outputs)
  • add integrations to travis-ci and readthedocs for continuous integration testing and self-building sphinx documentation, respectively

I've moved a copy of the present implementation and README of HELA into a directory called legacy, so the old functionality is preserved

New API

The API now allows for a workflow that looks like this:

# Generate an example dataset directory
from hela import generate_example_data
example_dir, training_dataset, samples_path = generate_example_data()

# Import RandomForest object from HELA
from hela import RandomForest
import matplotlib.pyplot as plt

# Initialize a random forest object:
rf = RandomForest(training_dataset, example_dir, samples_path)

# Train the random forest:
r2scores = rf.train(num_trees=1000, num_jobs=1)
plt.show()

# Predict posterior distributions from random forest
posterior = rf.predict(plot_posterior=True)
posterior_slopes, posterior_intercepts = posterior.samples.T
plt.show()

The train and predict methods on the RandomForest object work just like the rfretrieval.py train and rfretrieval.py predict functions worked in the legacy version.

Documentation

There's now a docs directory which contains sphinx self-building documentation which you can build locally on your copy of this branch with

python setup.py build_docs
open docs/_build/html/index.html

The narrative documentation starts with an example that @cefisher and I worked through today where we demonstrate fitting a line (by predicting the slope and intercept parameters using a random forest) implemented by HELA.

GitHub integrations/webhooks

After this PR is merged, we can update the GitHub webhooks/integrations to get travis and readthedocs up and running.

@bmorris3
Copy link
Member Author

bmorris3 commented Nov 8, 2019

I've just added a first (very generously unrigorous) end-to-end test which you can run with

python setup.py test

It runs the linear regression example used in the docs and checks that:

  • the R^2 values are near unity for the slope and intercept
  • the slope and intercept are close to the true (input) values

@bmorris3
Copy link
Member Author

bmorris3 commented Nov 8, 2019

After a struggle, the package now passes its first continuous integration tests! See passing travis builds here, which include a format/style checker (via flake8), and an end-to-end functional test of the entire module (via pytest). 🎉

Copy link
Collaborator

@pmneila pmneila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bmorris3,

Please to e-meet you as well :). I went quickly through your changes. It looks great.

I left some minor comments, specially regarding the names of a few functions/classes that we can discuss.

hela/forest.py Outdated Show resolved Hide resolved
hela/models.py Outdated Show resolved Hide resolved
hela/dataset.py Show resolved Hide resolved
hela/forest.py Outdated Show resolved Hide resolved
hela/forest.py Outdated Show resolved Hide resolved
hela/forest.py Outdated Show resolved Hide resolved
hela/forest.py Outdated Show resolved Hide resolved
hela/forest.py Outdated Show resolved Hide resolved
hela/forest.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@pmneila pmneila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bmorris3

I left a few more comments. My biggest concerns are about the necessity of the class Model. I think we could get rid of it and just provide some auxiliary functions, since the main functionality is already in the RandomForestWrapper, including saving/loading. We can discuss it in the next meeting on Wednesday, if you are around.

I can take care of the implementation details if you want. But maybe it's better to wait until the user interface is clear.

hela/model.py Outdated Show resolved Hide resolved
hela/model.py Outdated Show resolved Hide resolved
hela/model.py Outdated
return ranges.T


def generate_example_data():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is generating data with no noise and with independent slopes and ints, which makes the posterior computation pointless. I think generate_example_data should build an example dataset that show the ability of estimating posteriors, which is the main strength of this library. Ideally a dataset with some correlation between slopes and ints that show up in the posterior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to craft an example with correlations between slopes and ints but haven't been able to create a good working example. Would you be able to give me some more hints how to demonstrate this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bmorris3,

Sorry, I have realized that my comment was very vague and unclear. Something like this might work:

def generate_example_data():
    """
    Generate an example dataset in the new directory ``linear_dataset``.

    Returns
    -------
    example_dir : str
        Path to the directory of the example data
    training_dataset : str
        Path to the dataset metadata JSON file
    samples : `~numpy.ndarray`
    """
    example_dir = 'linear_dataset'
    training_dataset = os.path.join(example_dir, 'example_dataset.json')

    os.makedirs(example_dir, exist_ok=True)

    # Save the dataset metadata to a JSON file
    dataset = {
        "metadata": {
            "names": ["slope", "intercept"],
            "ranges": [[0, 1], [0, 1]],
            "colors": ["#F14532", "#4a98c9"],
            "num_features": 1000
        },
        "training_data": "training.npy",
        "testing_data": "testing.npy"
    }

    with open(training_dataset, 'w') as fp:
        json.dump(dataset, fp)

    # Generate fake training data
    npoints = 20000

    slopes = np.random.uniform(size=npoints)
    ints = np.random.uniform(size=npoints)
    x = np.linspace(0, 1, 1000)[:, np.newaxis]

    # Add correlated noise to parameters to introduce degeneracies
    noise_ints = np.random.normal(scale=0.15, size=npoints)
    noise_slopes = np.abs(noise_ints) + np.random.normal(scale=0.02, size=npoints)

    # Add also noise to data points (not strictly necessary)
    data = ((slopes + noise_slopes) * x + (ints + noise_ints) +
            np.random.normal(scale=0.01, size=(1000, npoints)))

    labels = np.vstack([slopes, ints])
    X = np.vstack([data, labels])

    # Split dataset into training and testing segments
    training = X[:, :int(0.8 * npoints)].T
    testing = X[:, int(0.8 * npoints):].T

    np.save(os.path.join(example_dir, 'training.npy'), training)
    np.save(os.path.join(example_dir, 'testing.npy'), testing)

    # Generate a bunch of samples with a test value to "retrieve" with the
    # random forest:
    true_slope = 0.7
    true_intercept = 0.5

    samples = true_slope * x + true_intercept
    return training_dataset, example_dir, samples.T[0]

The posteriors would look like this:

version3

hela/wrapper.py Outdated Show resolved Hide resolved
hela/model.py Outdated Show resolved Hide resolved
hela/model.py Outdated Show resolved Hide resolved
hela/model.py Outdated Show resolved Hide resolved
hela/model.py Outdated Show resolved Hide resolved
hela/model.py Outdated
return np.array([forest_i.feature_importances_ for forest_i in forests])


class Model(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about how the user should use this class. Is it that important to have a class like this? As a user of an API, I would expect having more fine-grained control.

  1. It saves the model during training. While this makes sense in command like, a user now might want to train a predict in a single session, with no saving involved. I think the user should be responsible of printing/saving when and where they want. Of course, we could give auxiliary functions to help with saving and printing.
  2. It assumes the format of the dataset. The user might want to use a different format.
  3. It loads the saved model from a specific location every time predict is called, which makes no sense if the user does not need to save the model.
  4. It only allows to predict from a single data_file given in the constructor.

The code from the train method, for example, should be what the user might write to train a model and build the predicted_vs_real plots. But I don't see why we should give that as a method of a class.

Copy link
Member Author

@bmorris3 bmorris3 Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should definitely take a closer look at this class.

I think the user should be responsible of printing/saving when and where they want. Of course, we could give auxiliary functions to help with saving and printing.

I don't feel strongly about this either way. I think there may still be use cases where the saving is helpful – for example when the training takes a very long time to run. Whatever you think makes sense 👍

It assumes the format of the dataset. The user might want to use a different format.

I'm not sure I understand – can't the user supply arbitrary x and y (in the correct shapes)?

It only allows to predict from a single data_file given in the constructor.

Good point. Could the solution to all of the above points be to not expose this class to the user, and just use it internally?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even think we need this class internally. Currently we expect the user to use the library like this:

from hela import Retrieval

# Here, Retrieval requires that the user uses our JSON format to describe
# the dataset. This is what I meant by "assumes the format of the dataset".
# They need to write a JSON file linking to NPY files that have a specific
# organization (first columns are features, last columns are labels). This is very
# restrictive.
r = Retrieval("dataset/dataset.json", "output_dir")

# Train the random forest
r2scores = r.train(num_trees=1000, num_jobs=1)
fig = r.plot_correlations()
fig.savefig("corr.pdf")
plt.show()

data = np.load("query.npy")
# Currently r.predict always loads the model from disk,
# which might be unnecessary.
posterior = r.predict(data)
# Do something with the posterior...

I find this very limited, but also unclear, as many non-obvious, implicit operations are happening under the hood (saving the model and building R2 scores in train, loading the model and printing percentiles in predict...)

Instead, without Retrieval, the user could do this:

import hela 

# Load the dataset
# We offer this function for convenience, but this is optional. The user
# might write their own load_dataset, or read the data from NPY files.
dataset = hela.load_dataset("dataset/training_dataset.json")

# Train the model
# Again, this is a helper function. To have a more fine grained control, the user
# can create a PosteriorRandomForest and call the fit method.
forest = hela.train_model(dataset, num_trees=1000, num_jobs=1)

# Print some info info
print(forest.oob_score_)
...

# Save the model. Optional.
hela.save("output/model.pkl", forest)

# Run validation
dataset = hela.load_dataset("dataset/testing_dataset.json")
pred, r2scores = hela.test_model(forest, dataset)

# Plot
fig = hela.plot_predicted_vs_real(dataset.y, pred, dataset.names, dataset.ranges)
fig.savefig("output/r2.pdf")

# Prediction
data = np.load("query.npy")
# No load required here, but we can load a model with forest = hela.load("output/model.pkl")
posterior = forest.predict_posterior(data)
# or hela.predict_posterior(forest, data) to be consistent with other functions.
# Do something with the posterior...
posterior_ranges = hela.posterior_percentiles(posterior) # or posterior.percentiles()

I think this approach gives complete control to the user, and at the same time it is more clear, as every operation is explicit and only happens if the user asks for it.

If you are fine with the second approach and it suits the astropy developer standards, we can remove the class Retrieval and prepare the collection of helper functions (save, load...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay in response before I got back to you on this, it's been a busy few weeks!

I see the advantages to the above approach, but I'm still searching for a more object-oriented project structure and workflow. Let me see if I can split the difference between our proposals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest refactor looks like:

import matplotlib.pyplot as plt
from hela import Retrieval, Dataset, generate_example_data

# Create an example dataset
training_dataset, example_dir, example_data = generate_example_data()

# Load the dataset
dataset = Dataset.load_json("linear_dataset/example_dataset.json")

# Train the model
r = Retrieval()
r2scores = r.train(dataset, num_trees=1000, num_jobs=5)

# Print OOB score (optional)
print(r.oob)

# Plot predicted vs real:
fig, ax = r.plot_predicted_vs_real(dataset)

# Save the model (optional)
# from hela import save_model
# save_model("linear_dataset/model.pkl", r.model)

# Predict posterior distribution for slope and intercept of example data
posterior = r.predict(example_data)

# Print posterior ranges (optional)
posterior.print_percentiles(dataset.names)

fig2, ax2 = posterior.plot_posterior_matrix(dataset)

plt.show()

Is that a sufficient compromise between the explicit function calls you're looking for and the object-oriented structure I'm looking for? Again, happy to iterate 👍

hela/wrapper.py Outdated Show resolved Hide resolved
@bmorris3
Copy link
Member Author

Hey @pmneila – perhaps we can move forward with this PR, if you have a minute?

@pmneila
Copy link
Collaborator

pmneila commented Apr 15, 2020

Hi @bmorris3

Sure, let's finish this.

BTW, Following some requests from @cefisher, I've implemented some additional functionality into HELA. However, this PR was created before those changes were implemented. How do you think we should proceed? Do we finish with this PR with no extra functionality and then we integrate it in subsequent PRs? I think that's the best option.

@bmorris3
Copy link
Member Author

I'm happy to go with your best judgement! Would that be much extra work for you? I don't know how big the merge conflicts are.

Copy link
Collaborator

@pmneila pmneila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge conflicts are going to be huge. 😅 After going through the code, I still think the best solution is to continue working on this PR and then I'll integrate the new functionality following this new API as much as possible.

There are a few things in the new code that you might find relevant at this point:

  1. It will be necessary to pass the training data at prediction time (so we can compute things like the best-fit model (@cefisher can give you more details on this if you want)). One of the consequences of this is that we won't need to store names, ranges and colors in the RandomForestPosterior, since they will be available at test time in the dataset. I think this is actually much better. I found it very weird that the forest should save that information.
  2. I split the classes PosteriorRandomForest (old Model) and Posterior into two different files. I don't think this is problem, but maybe you want to keep everything in posteriors.py?
  3. I added type annotations. Not sure if that might be a problem with Astropy?

I went through the code and I think we are almost there. I added a few comments below.

@@ -18,7 +18,7 @@
POSTERIOR_MAX_SIZE = 10000


def plot_predicted_vs_real(y_real, y_pred, names, ranges, alpha='auto'):
def plot_predicted_vs_real(dataset, retrieval, alpha='auto'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieval is only used to get the prediction. Pass the prediction instead.

@@ -179,6 +180,46 @@ def __init__(self, samples, weights):
self.samples = samples
self.weights = weights

def print_percentiles(self, names):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function prints to the standard output. Code is much simpler if all functions with side effects (such as I/O) are present only in the high-level classes of the code (in this case, in the class Retrieval). Posterior is a low-level class, used internally to represent posteriors. A method like print_percentiles makes sense only in the context of the class Retrieval, but it is very odd for any other user of this class.

I understand that what you wanted to to is to allow the final user to do something like:

posterior = retrieval.predict(example_data)
posterior.print_percentiles(dataset.names)

I can think of two possible solutions. The simplest one is changing the interface to something like:

posterior = retrieval.predict(example_data)
retrieval.print_percentiles(dataset.names, posterior)

This also is coherent with the fact that all other functions for printing/plotting are in Retrieval.

The other solution, if you really want to keep the original API, would be encapsulating the posterior in a more specific class of posteriors for retrieval (RetrievalPosterior?) that contains all the methods that make sense in the context of the class Retrieval.

[values[0], values[2] - values[0], values[0] - values[1]])
return ranges.T

def plot_posterior_matrix(self, dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in print_percentiles, this function performs I/O in the class Posterior. Same things apply here.

Also, this function leads to a circular dependency in the code, since posteriors.py now depends on plot.py and vice versa.

"[+{:.3g} -{:.3g}]".format(name_i, *pred_range_i))

def data_ranges(self, percentiles=(50, 16, 84)):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is fine, but it seems a bit strange that this method is in the class Posterior and the function resample_posterior isn't. Rather put them both as separated functions or both as methods of this class? I have a slight preference towards independent functions and against classes, but if you prefer methods it'll be fine for me.


# saving model information...
self.oob = self.model.rf.oob_score_

pred, r2scores = test_model(self.model, self.dataset)
pred, r2scores = test_model(self.model, dataset)
self.pred = pred
return r2scores
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we also return pred? Not saying that we must do it, but if we return r2scores, it seems natural.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants